Improving Precision of Keywords Extracted From Persian Text Using Word2Vec Algorithm

نویسندگان

Amiri jezeh, Ali Imam Hossein University

Hasni Ahangar, Mohammad Reza Imam Hossein University

چکیده مقاله:

Keywords can present the main concepts of the text without human intervention according to the model. Keywords are important vocabulary words that describe the text and play a very important role in accurate and fast understanding of the content. The purpose of extracting keywords is to identify the subject of the text and the main content of the text in the shortest time. Keyword extraction plays an important role in the fields of text summarization, document labeling, information retrieval, and subject extraction from text. For example, summarizing the contents of large texts into smaller texts is difficult, but having keywords in the text can make you aware of the topics in the text. Identifying keywords from the text with common methods is time-consuming and costly. Keyword extraction methods can be classified into two types with observer and without observer. In general, the process of extracting keywords can be explained in such a way that first the text is converted into smaller units called the word, then the redundant words are removed and the remaining words are weighted, then the keywords are selected from these words. Our proposed method in this paper for identifying keywords is a method with observer. In this paper, we first calculate the word correlation matrix per document using a feed forward neural network and Word2Vec algorithm. Then, using the correlation matrix and a limited initial list of keywords, we extract the closest words in terms of similarity in the form of the list of nearest neighbors. Next we sort the last list in descending format, and select different percentages of words from the beginning of the list, and repeat the process of learning the neural network 10 times for each percentage and creating a correlation matrix and extracting the list of closest neighbors. Finally, we calculate the average accuracy, recall, and F-measure. We continue to do this until we get the best results in the evaluation, the results show that for the largest selection of 40% of the words from the beginning of the list of closest neighbors, the acceptable results are obtained. The algorithm has been tested on corpus with 800 news items that have been manually extracted by keywords, and laboratory results show that the accuracy of the suggested method will be 78%.

Download for Free

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Text Summarization Extraction System (TSES) Using Extracted Keywords

A new technique to produce a summary of an original text investigated in this paper. The system develops many approaches to solve this problem that gave a high quality result. The model consists of four stages. The preprocess stages convert the unstructured text into structured. In first stage, the system removes the stop words, pars the text and assigning the POS (tag) for each word in the tex...

متن کامل

Improving Persian Text Classification and Clustering Using Persian Thesaurus

This paper proposes an innovative approach to improve the classification performance of Persian texts. The proposed method uses a thesaurus as a helpful knowledge to obtain more representative word-frequencies in the corpus. Two types of word relationships are considered in our used thesaurus. This is the first attempt to use a Persian thesaurus in the field of Persian information retrieval. Ex...

متن کامل

A Study on Automatically Extracted Keywords in Text Categorization

This paper presents a study on if and how automatically extracted keywords can be used to improve text categorization. In summary we show that a higher performance — as measured by micro-averaged F-measure on a standard text categorization collection — is achieved when the full-text representation is combined with the automatically extracted keywords. The combination is obtained by giving highe...

متن کامل

Text Generation from Keywords

We describe a method for generating sentences from “keywords” or “headwords”. This method consists of two main parts, candidate-text construction and evaluation. The construction part generates text sentences in the form of dependency trees by using complementary information to replace information that is missing because of a “knowledge gap” and other missing function words to generate natural ...

متن کامل

Automatic Persian text summarizer using simulated annealing and genetic algorithm

Automatic text summarization is a process to reduce the volume of text documents using computer programs to create a text summary with keeping the key terms of the documents. Due to cumulative growth of information and data, automatic text summarization technique needs to be applied in various domains. The approach helps in decreasing the quantity of the document without changing the context of...

متن کامل

Improving MT System Using Extracted Parallel Fragments of Text from Comparable Corpora

In this article, we present an automated approach of extracting English-Bengali parallel fragments of text from comparable corpora created using Wikipedia documents. Our approach exploits the multilingualism of Wikipedia. The most important fact is that this approach does not need any domain specific corpus. We have been able to improve the BLEU score of an existing domain specific EnglishBenga...

متن کامل

منابع من

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

عنوان ژورنال

پردازش علائم و داده ها

دوره 18 شماره 1

صفحات 60- 51

تاریخ انتشار 2021-05

دنبال کردن

لغو دنبال کردن

{@ msg @}

با دنبال کردن یک ژورنال هنگامی که شماره جدید این ژورنال منتشر می شود به شما از طریق ایمیل اطلاع داده می شود.

کلمات کلیدی

کلمات کلیدی برای این مقاله ارائه نشده است

میزبانی شده توسط پلتفرم ابری doprax.com